Opening Questions

  • How do we know if water quality meets environmental standards?
  • What sample size do we need for reliable biodiversity surveys?
  • How can we predict extreme climate events?

Today’s Journey

  1. Normal Distribution
    • Environmental patterns
    • Probability calculations
  2. Sampling Distributions
    • From samples to populations
  3. Central Limit Theorem
    • Making reliable predictions

Why This Matters

Common Environmental Applications

  • Water quality monitoring
  • Species population assessment
  • Climate variation analysis
  • Pollution level compliance

Common Confusions to Avoid

Watch Out For These!

  1. Population vs Sample
    • Population parameters (μ, σ) are usually unknown
    • We estimate them from sample statistics (\bar{x}, s)
    • Example: All possible stream temperatures vs our measurements
  2. Distribution Shape vs Mean
    • Same mean doesn’t mean same distribution
    • Need to consider spread and shape
    • Example: Two sites with same average temperature but different variability
  3. Sample Size Effects
    • Larger samples = Better estimates
    • But how large is “large enough”?
    • Depends on how variable your data is

Learning Outcomes

  • Understand what a (probability) distribution is:
    • the properties of a continuous distribution.
  • Use Normal Distribution to understand/describe data
    • Be able to standardise a Normal;
    • Calculate probabilities based on Normal Distribution using R.
  • Know that there are other continuous distributions useful in hypothesis testing.
  • Distinguish between population, sample and sampling distributions;
  • Distinguish between a standard deviation and standard error of the mean;
  • Describe the Central Limit Theorem;
  • Use R and Excel to calculate the standard error and probabilities associated with sampling distributions;

Types of data

  • Numerical
    • Continuous: yield, weight
    • Discrete: weeds per m^2
  • Categorical
    • Binary: 2 mutually exclusive categories
    • Ordinal: categories ranked in order
    • Nominal: qualitative data

Example

  • The gestation period (in days) for American Simmental cattle is distributed with mean 284.3 and standard deviation 5.52. How often is a calf born a week early?

Wray et al. 1987

Eryk - stock.adobe.com

What is a distribution

  • In our case we are generally referring to a distribution function
    • This is a function (or model) that describes the probability that a system will take on value or set of values {x}
  • For any variable X, we describe probabilities by
    • Discrete variables: probability distribution function P(X=x)
    • Continuous variables: probability density function f(x)
    • Discrete and Continuous variables: cumulative density function F(x) = P(X≤x)

Environmental Data Example: Water Temperature

In environmental science, we often need to understand the pattern of measurements to make decisions. Let’s look at stream temperature monitoring:

Temperature Monitoring Background

  • Daily water temperature measurements follow patterns
  • Understanding these patterns helps protect aquatic life
  • We need to assess risks of extreme temperatures
Code
# Define parameters
temp_mean <- 22 # Mean temperature in °C
temp_sd <- 1.5 # Standard deviation in °C
thresh <- 24 # Environmental threshold

# Create temperature range for plotting
temp_range <- seq(temp_mean - 4 * temp_sd, temp_mean + 4 * temp_sd, length.out = 1000)
temp_df <- data.frame(temperature = temp_range)

# Create visualization
ggplot(temp_df, aes(x = temperature)) +
  stat_function(
    fun = dnorm, args = list(mean = temp_mean, sd = temp_sd),
    color = "blue"
  ) +
  geom_vline(xintercept = thresh, linetype = "dashed", color = "red") +
  annotate("text",
    x = thresh + 0.5, y = 0.2,
    label = "Environmental\nThreshold",
    color = "red", hjust = 0
  ) +
  labs(
    title = "Stream Water Temperature Distribution",
    subtitle = "Daily measurements follow a pattern we can describe",
    x = "Temperature (°C)",
    y = "Relative Frequency"
  ) +
  theme_cowplot()

This pattern in our data lets us: - Predict future temperatures - Assess risks to aquatic life - Plan monitoring strategies - Make management decisions

Understanding how to describe and work with these patterns is key to environmental science.

Properties of a Continuous Distribution

  • For any continuous distribution
    • There is an infinite number of possible values;
    • These values may be within a fixed interval. For example, male human heights (in cm) belong to [54.6,272].

Human height

  • Specific values in a continuous distribution have probability 0. For example, the likelihood of measuring a Simmental cow at exactly 450kg is zero. This is because there are potentially an infinite number of other weights that are higher or lower than 450 kg so we say that measuring exactly 450 has a very very small probability which is equivalent to zero
  • The total of all the probabilities = must be 1. (Total area under the pdf)

The Normal Distribution

  • The Normal Distribution is super important because it occurs everywhere! It naturally describes many natural phenomenon and is a great for modelling the sample mean.
  • It is a symmetric bell-shaped variable with two parameters \mu and \sigma^2 such that:

X\sim{N(\mu,\sigma^2)}

The Standard Normal Curve

  • The standard normal curve is one where the mean = 0, and variance = 1

X\sim{N(\mu=0,\sigma^2=1)}

Code
# Create a sequence of values for the x-axis
x_values <- seq(-4, 4, by = 0.01)

# Create a data frame to hold these values
data_frame <- data.frame(x_values)

# Plot the standard normal curve
ggplot(data_frame, aes(x = x_values)) +
  stat_function(
    fun = dnorm, args = list(mean = 0, sd = 1),
    color = "blue"
  ) +
  labs(
    title = "Standard Normal Curve",
    x = "Z-Score",
    y = "Density"
  )

The General Normal Curve

  • Simmental cattle gestation times…
Code
# Define parameters
mean <- 284.3
sd <- 5.52

# Create data and plot
x_values <- seq(mean - 4 * sd, mean + 4 * sd, length.out = 1000)
df <- data.frame(x = x_values)

ggplot(df, aes(x = x_values)) +
  stat_function(
    fun = dnorm, args = list(mean = mean, sd = sd),
    color = "blue"
  ) +
  labs(
    title = "Normal Distribution Curve",
    x = "Gestation Period (days)",
    y = "Density"
  )

The General Normal Distribution

If X\sim{N(\mu,\sigma^2)}

  • PDF

f(x | \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2} for x \in (-\infty,\infty)

  • CDF

F(x)=P(X\le x)=\int_{-\infty}^x f(y)dy

Types of Normal Probabilitites

  • There are 3 type of probabilities that we are interested in:
    • Tail probabilities (lower and upper) = Cumulative probabilities
    • Interval probabilities;
    • Inverse probabilities.

Normal distribution in R

Types of Normal Probabilities in Environmental Science

Lower Tail: Early Warning Thresholds

Water Temperature Example: P(T\le 20) - Risk of cold stress Gestation Example: P(X\le 275) - Early birth monitoring

Code
# Calculate probabilities
p_cold <- pnorm(20, temp_mean, temp_sd)
p_early <- pnorm(275, 284.3, 5.52)

# Create temperature plot
temp_plot <- ggplot(temp_df, aes(x = temperature)) +
  stat_function(
    fun = dnorm, args = list(mean = temp_mean, sd = temp_sd),
    color = "blue"
  ) +
  geom_area(
    data = subset(temp_df, temperature <= 20),
    aes(y = dnorm(temperature, temp_mean, temp_sd)),
    fill = "blue", alpha = 0.3
  ) +
  geom_vline(xintercept = 20, linetype = "dashed", color = "darkblue") +
  labs(
    title = "Water Temperature",
    subtitle = sprintf("P(T ≤ 20°C) = %.1f%%", p_cold * 100),
    x = "Temperature (°C)",
    y = "Density"
  )

# Create gestation plot
gest_df <- data.frame(
  days = seq(284.3 - 4 * 5.52, 284.3 + 4 * 5.52, length.out = 1000)
)

gest_plot <- ggplot(gest_df, aes(x = days)) +
  stat_function(
    fun = dnorm, args = list(mean = 284.3, sd = 5.52),
    color = "blue"
  ) +
  geom_area(
    data = subset(gest_df, days <= 275),
    aes(y = dnorm(days, 284.3, 5.52)),
    fill = "blue", alpha = 0.3
  ) +
  geom_vline(xintercept = 275, linetype = "dashed", color = "darkblue") +
  labs(
    title = "Gestation Period",
    subtitle = sprintf("P(X ≤ 275 days) = %.1f%%", p_early * 100),
    x = "Days",
    y = "Density"
  )

# Display plots side by side
gridExtra::grid.arrange(temp_plot, gest_plot, ncol = 2)
Code
# Print probabilities with context
cat("Environmental Implications:\n")
Environmental Implications:
Code
cat(sprintf("- %.1f%% chance of temperatures below cold stress threshold\n", p_cold * 100))
- 9.1% chance of temperatures below cold stress threshold
Code
cat(sprintf("- %.1f%% chance of early birth (before 275 days)\n", p_early * 100))
- 4.6% chance of early birth (before 275 days)

These lower tail probabilities help us: - Identify risks of extreme events - Plan monitoring and intervention strategies - Set appropriate warning thresholds - Make evidence-based management decisions

Upper Tail Probabilities: Critical Thresholds

Monitoring values above critical thresholds helps identify potential risks:

Environmental Management Examples:

  • Water temperature above stress threshold: P(T\ge 24)
  • Extended gestation period: P(X\ge 290)
Code
# Calculate probabilities
p_heat <- 1 - pnorm(24, temp_mean, temp_sd)
p_late <- 1 - pnorm(290, 284.3, 5.52)

# Plot temperature threshold
temp_plot <- ggplot(temp_df, aes(x = temperature)) +
  stat_function(
    fun = dnorm, args = list(mean = temp_mean, sd = temp_sd),
    color = "blue"
  ) +
  geom_area(
    data = subset(temp_df, temperature >= 24),
    aes(y = dnorm(temperature, temp_mean, temp_sd)),
    fill = "red", alpha = 0.3
  ) +
  geom_vline(xintercept = 24, linetype = "dashed", color = "red") +
  labs(
    title = "Water Temperature",
    subtitle = sprintf("P(T ≥ 24°C) = %.1f%%", p_heat * 100),
    x = "Temperature (°C)",
    y = "Density"
  ) +
  annotate("text",
    x = 25, y = 0.2,
    label = "Critical\nThreshold",
    color = "red"
  )

# Plot gestation threshold
gest_plot <- ggplot(gest_df, aes(x = days)) +
  stat_function(
    fun = dnorm, args = list(mean = 284.3, sd = 5.52),
    color = "blue"
  ) +
  geom_area(
    data = subset(gest_df, days >= 290),
    aes(y = dnorm(days, 284.3, 5.52)),
    fill = "red", alpha = 0.3
  ) +
  geom_vline(xintercept = 290, linetype = "dashed", color = "red") +
  labs(
    title = "Gestation Period",
    subtitle = sprintf("P(X ≥ 290 days) = %.1f%%", p_late * 100),
    x = "Days",
    y = "Density"
  ) +
  annotate("text",
    x = 291, y = 0.05,
    label = "Extended\nGestation",
    color = "red"
  )

# Display plots side by side
gridExtra::grid.arrange(temp_plot, gest_plot, ncol = 2)
Code
# Print management implications
cat("Management Implications:\n")
Management Implications:
Code
cat(sprintf("- %.1f%% risk of thermal stress for aquatic life\n", p_heat * 100))
- 9.1% risk of thermal stress for aquatic life
Code
cat(sprintf("- %.1f%% chance of extended gestation requiring monitoring\n", p_late * 100))
- 15.1% chance of extended gestation requiring monitoring

Risk Management Applications:

  1. Environmental threshold monitoring
  2. Early warning systems
  3. Resource allocation planning
  4. Intervention timing decisions

Normal Operating Ranges: Interval Probabilities

In environmental monitoring and animal health, we often need to know the probability of measurements falling within expected ranges:

Examples: - Stream temperature in optimal range: P(21\le T\le 23) - Normal gestation period: P(280\le X\le 285)

Code
# Calculate probabilities for both examples
p_temp_normal <- diff(pnorm(c(21, 23), temp_mean, temp_sd))
p_gest_normal <- diff(pnorm(c(280, 285), 284.3, 5.52))

# Create temperature plot
temp_plot <- ggplot(temp_df, aes(x = temperature)) +
  stat_function(
    fun = dnorm, args = list(mean = temp_mean, sd = temp_sd),
    color = "blue"
  ) +
  geom_area(
    data = subset(temp_df, temperature >= 21 & temperature <= 23),
    aes(y = dnorm(temperature, temp_mean, temp_sd)),
    fill = "green", alpha = 0.3
  ) +
  geom_vline(xintercept = c(21, 23), linetype = "dashed", color = "darkgreen") +
  labs(
    title = "Stream Temperature",
    subtitle = sprintf("Optimal Range: %.1f%% between 21-23°C", p_temp_normal * 100),
    x = "Temperature (°C)",
    y = "Density"
  )

# Create gestation plot
gest_plot <- ggplot(gest_df, aes(x = days)) +
  stat_function(
    fun = dnorm, args = list(mean = 284.3, sd = 5.52),
    color = "blue"
  ) +
  geom_area(
    data = subset(gest_df, days >= 280 & days <= 285),
    aes(y = dnorm(days, 284.3, 5.52)),
    fill = "green", alpha = 0.3
  ) +
  geom_vline(xintercept = c(280, 285), linetype = "dashed", color = "darkgreen") +
  labs(
    title = "Gestation Period",
    subtitle = sprintf("Normal Range: %.1f%% between 280-285 days", p_gest_normal * 100),
    x = "Days",
    y = "Density"
  )

# Display plots side by side
gridExtra::grid.arrange(temp_plot, gest_plot, ncol = 2)
Code
# Print implications
cat("Monitoring Implications:\n")
Monitoring Implications:
Code
cat(sprintf("- %.1f%% of temperature readings should fall in optimal range\n", p_temp_normal * 100))
- 49.5% of temperature readings should fall in optimal range
Code
cat(sprintf("- %.1f%% of births expected during normal period\n", p_gest_normal * 100))
- 33.2% of births expected during normal period
Code
cat("\nManagement Applications:\n")

Management Applications:
Code
cat("1. Setting monitoring frequency\n")
1. Setting monitoring frequency
Code
cat("2. Resource allocation planning\n")
2. Resource allocation planning
Code
cat("3. Early intervention thresholds\n")
3. Early intervention thresholds
Code
cat("4. Performance benchmarking\n")
4. Performance benchmarking

This analysis shows how interval probabilities help: - Define normal operating conditions - Set realistic expectations - Plan resource allocation - Design monitoring programs

Finding Critical Values for Management

When designing monitoring programs, we often need to find values that capture specific probabilities:

  • What temperature should trigger interventions? P(T \le x)=0.9
  • When should we flag delayed births? P(X \le x)=0.95
Code
# Calculate critical values
temp_90 <- qnorm(0.9, temp_mean, temp_sd)
gest_95 <- qnorm(0.95, 284.3, 5.52)

# Create temperature plot with critical value
temp_plot <- ggplot(temp_df, aes(x = temperature)) +
  stat_function(
    fun = dnorm, args = list(mean = temp_mean, sd = temp_sd),
    color = "blue"
  ) +
  geom_area(
    data = subset(temp_df, temperature <= temp_90),
    aes(y = dnorm(temperature, temp_mean, temp_sd)),
    fill = "purple", alpha = 0.3
  ) +
  geom_vline(xintercept = temp_90, linetype = "dashed", color = "purple") +
  annotate("text",
    x = temp_90, y = 0.2,
    label = sprintf("90th Percentile\n%.1f°C", temp_90),
    color = "purple", hjust = -0.1
  ) +
  labs(
    title = "Water Temperature Critical Value",
    subtitle = "90% of readings fall below this threshold",
    x = "Temperature (°C)",
    y = "Density"
  )

# Create gestation plot with critical value
gest_plot <- ggplot(gest_df, aes(x = days)) +
  stat_function(
    fun = dnorm, args = list(mean = 284.3, sd = 5.52),
    color = "blue"
  ) +
  geom_area(
    data = subset(gest_df, days <= gest_95),
    aes(y = dnorm(days, 284.3, 5.52)),
    fill = "purple", alpha = 0.3
  ) +
  geom_vline(xintercept = gest_95, linetype = "dashed", color = "purple") +
  annotate("text",
    x = gest_95, y = 0.05,
    label = sprintf("95th Percentile\n%.1f days", gest_95),
    color = "purple", hjust = -0.1
  ) +
  labs(
    title = "Gestation Period Critical Value",
    subtitle = "95% of births occur before this time",
    x = "Days",
    y = "Density"
  )

# Display plots side by side
gridExtra::grid.arrange(temp_plot, gest_plot, ncol = 2)
Code
# Print management implications
cat("Management Thresholds:\n")
Management Thresholds:
Code
cat(sprintf("- Consider interventions when temperature exceeds %.1f°C\n", temp_90))
- Consider interventions when temperature exceeds 23.9°C
Code
cat(sprintf("- Investigate if gestation exceeds %.1f days\n", gest_95))
- Investigate if gestation exceeds 293.4 days
Code
cat("\nApplications:\n")

Applications:
Code
cat("1. Setting monitoring thresholds\n")
1. Setting monitoring thresholds
Code
cat("2. Designing intervention protocols\n")
2. Designing intervention protocols
Code
cat("3. Resource planning\n")
3. Resource planning
Code
cat("4. Risk assessment")
4. Risk assessment

These critical values help establish evidence-based management protocols and early warning systems.

Connecting Probability to Sampling

We’ve seen how to: - Calculate various types of probabilities - Work with environmental thresholds - Make evidence-based decisions

But in real-world monitoring, we rarely know the true population parameters. Instead: - We take samples - Calculate sample statistics - Use these to make inferences

Key Questions

This leads us to two crucial questions: 1. How do sample means behave? 2. How reliable are our probability calculations with sample data?

The Central Limit Theorem will help us answer these questions…

Progress Check ✓

Let’s review what we’ve learned about probability calculations:

  • We can calculate different types of probabilities:
    • Lower tail: P(X \leq x) - early warning
    • Upper tail: P(X \geq x) - critical thresholds
    • Interval: P(a \leq X \leq b) - normal ranges
    • Inverse: Finding x for given probability
  • These help us:
    • Assess environmental risks
    • Set monitoring thresholds
    • Make evidence-based decisions
    • Plan interventions
  • Questions to consider:
    • How do these calculations change with sample data?
    • What happens to probabilities as sample size changes?

Example

  • Let’s return to our example for American Simmental cattle where X~N(284.3, 5.522),
  • What is the probability of a gestation time less than 275 days.

So we need to calculate the lower Tail probability such that: P(X \le 275)

Code
# Calculate probability of early gestation
# P(X ≤ 275) where X ~ N(284.3, 5.52²)
pnorm(275, 284.3, 5.52)
[1] 0.04601526
  • One would expect around 5% of gestation times would be less than 275 days/
  • Question for you: Why might this be important, how can we use these results??

vxnaghiyev - stock.adobe.com

Back to the Standard Normal Curve

  • Sometimes it is useful to standardise “data” as it allows us to compare samples that are drawn from populations that may have different means and standard deviations
  • Luckily for us we can standardise any general normal distribution X\sim{N(\mu,\sigma^2)} to a standard normal distribution X\sim{N(0,1)}
  • This was also useful as we could use a set of standard normal tables to calculate probabilities (before computers were readily available).

P(X \le x)=P\left(\frac{X-\mu}{\sigma},\frac{x-\mu}{\sigma}\right)=P\left(Z \le \frac{x-\mu}{\sigma}\right)

Standard Normal Curve

  • Example: if X\sim{N(10,9)} find P(X \le 14)

P(X \le x)=P\left(\frac{X-\mu}{\sigma},\frac{x-\mu}{\sigma}\right)=P\left(Z \le \frac{x-\mu}{\sigma}\right)

P(X \le 14)=P\left(\frac{X-10}{\sqrt9},\frac{14-10}{\sqrt9}\right)=P\left(Z \le \frac{4}{3}\right)

Code
# Calculate P(X ≤ 14) where X ~ N(10, 9)
# Method 1: Direct calculation
prob1 <- pnorm(14, 10, 3)

# Method 2: Using standardized value
prob2 <- pnorm(4 / 3, 0, 1)

# Method 3: Using standardized value (default parameters)
prob3 <- pnorm(4 / 3)

# Display results
c(direct = prob1, standardized = prob2, default = prob3)
      direct standardized      default 
   0.9087888    0.9087888    0.9087888 

Percentiles of the Standard Normal Curve

  • 1 Standard deviation from the mean = 68% of the data

Percentiles of the Standard Normal Curve

  • 2 Standard deviations from the mean = 95% of the data

Percentiles of the Standard Normal Curve

  • 3 Standard deviations from the mean = 99.7% of the data

Not so Normal Distributions

  • Student T: The Student T distribution models a symmetric bell-shaped variable with thicker tails than a Normal.
  • We say the variable X \sim t_n, with n degrees of freedom.
  • It has an extra parameter = n which is related to population size
  • The T distribution is used for the 1 and 2 sample T-tests which are really important in the next few weeks.

Not so Normal Distributions

  • The Chi-Squared distribution models a variable which can only take positive values and is skewed in distribution.
  • We say the variable X \sim \chi_n^2, with n degrees of freedom.
  • The Chi-Squared distribution is used for the Chi-Squared Test which you will cover in the next few weeks.

Sampling distributions

  • Rye grass root growth (in mg dry weight) follows the distribution X \sim N(300, 502).
    1. One measurement is taken: how likely is it that the dry weight exceeds 320 mg?
    1. 10 measurements are taken: how likely is it that the sample mean exceeds 320 mg?

SERHII BLIK - stock.adobe.com

Sampling distributions

  • Here, we are dealing with 2 distributions:
      1. Measurement: X \sim N(300,50^2)
      1. Sample Mean of 10 measurements: \overline X = \frac{1}{10}\Sigma_{i=1}^{10} X_i \sim ...

How does the sampling distribution occur?

  • http://onlinestatbook.com/stat_sim/sampling_dist/
  • We have a population X
    • We take a sample of size n and we calculate the mean \overline x_1
    • We take another sample of size n and we calculate the mean \overline x_2
    • We take another sample of size n and we calculate the mean \overline x_3 … If we sample all possibilities, then the sampling distribution of \overline X = \frac{1}{10}\Sigma_{i=1}^{10} X_i is the distribution of \{\overline x_1, \overline x_2, \overline x_3,...\}

Distribution for a sample mean

  • if X\sim{N(\mu,\sigma^2)}
  • then \overline X\sim{N(\mu,\frac{\sigma^2}{n})}
  • Note that we call
    • \sigma the standard deviation such that sd(X)=\sigma, and
    • \sigma/\sqrt n the standard error such that sd(\overline X)=\sigma/\sqrt n
  • The standard error is important for making inference on a sample populations i.e. how close your sample mean \overline x is to the population mean \mu

Example

  • Rye grass root growth (in mg dry weight) follows the distribution X \sim N(300,50^2).
      1. One measurement is taken: how likely is it that the dry weight exceeds 320 mg?
      1. 10 measurements are taken: how likely is it that the sample mean exceeds 320 mg?

Example

    1. X = Rye grass root growth \sim N(300,50^2)

P(X>320) = P\left(\frac{X-\mu}{\sigma},\frac{x-\mu}{\sigma}\right)=P\left(\frac{X-300}{50}>\frac{320-300}{50}\right) =P(Z > 0.4) =1-P(Z < 0.4) \approx 1-0.66 = 0.34

Code
1 - pnorm(0.4)
[1] 0.3445783

Example

    1. \overline X = Rye grass root growth \sim N(300,\frac{50^2}{10})

P(\overline X>320) = P\left(\frac{\overline X-\mu}{\frac{\sigma}{\sqrt{n}}},\frac{x-\mu}{\frac{\sigma}{\sqrt{n}}}\right)=P\left(\frac{\overline X-300}{\frac{50}{\sqrt{10}}}>\frac{320-300}{\frac{50}{\sqrt{10}}}\right) =P(Z > 1.26) =1-P(Z < 1.26) \approx 1-0.90 = 0.10

Code
# Calculate probability of sample mean exceeding 320mg
# P(X̄ > 320) where X̄ ~ N(300, 50²/10)
# Standardized to P(Z > 1.26)
1 - pnorm(1.26)
[1] 0.1038347

The Central Limit Theorem in Environmental Monitoring

Why This Matters

In environmental science, we often face: - Non-normal data (e.g., pollution levels, species counts) - Need to aggregate multiple measurements - Want to make reliable inferences

The Central Limit Theorem (CLT) tells us: - Sample means follow a normal distribution - Regardless of the original distribution shape - More samples = more normally distributed

Requirements for CLT

For reliable results, we need: 1. Independent random samples 2. Large enough sample size: - n > 30 for skewed data (e.g., pollution levels) - n > 15 for symmetric data (e.g., temperature readings) 3. Finite variance exists

Environmental Applications

Let’s see this in action with environmental data…

Example: Water Quality Monitoring

Consider daily pollution measurements: - Often right-skewed (many low values, few high spikes) - Single readings can be misleading - Need to understand behavior of sample means

CLT in Environmental Monitoring

Let’s demonstrate how the CLT works with real environmental data:

Code
# Set random seed for reproducibility
set.seed(123)

# Define parameters
number_of_samples <- 1000
sample_sizes <- c(5, 10, 30, 50)
distributions <- list(
  "Normal" = rnorm,
  "Exponential" = rexp,
  "Chi-Squared (df = 2)" = function(n) rchisq(n, df = 2)
)

# Function to generate sample means
generate_sample_means <- function(sample_size, number_of_samples, dist_function) {
  sapply(1:number_of_samples, function(x) mean(dist_function(sample_size)))
}

# Generate sample means for all combinations
sample_means_list <- lapply(distributions, function(dist_function) {
  lapply(sample_sizes, generate_sample_means,
    number_of_samples = number_of_samples,
    dist_function = dist_function
  )
})

# Convert to data frame
sample_means_df <- do.call(rbind, lapply(names(distributions), function(dist_name) {
  do.call(rbind, lapply(1:length(sample_sizes), function(i) {
    data.frame(
      Distribution = dist_name,
      Sample_Size = sample_sizes[i],
      Sample_Mean = sample_means_list[[dist_name]][[i]]
    )
  }))
}))

Central Limit Theorem in Action

Environmental Monitoring Examples

We’ll demonstrate the CLT using three types of environmental data: 1. Stream temperatures (normally distributed) 2. Air pollution levels (right-skewed) 3. Species counts (discrete data)

Visualizing the CLT

Let’s see how sample means behave as sample size increases:

Code
# Rename distributions to environmental context
sample_means_df <- sample_means_df %>%
  mutate(Distribution = case_when(
    Distribution == "Normal" ~ "Stream Temperature",
    Distribution == "Exponential" ~ "Air Pollution",
    Distribution == "Chi-Squared (df = 2)" ~ "Species Abundance",
    TRUE ~ Distribution
  ))

# Enhanced visualization
ggplot(sample_means_df, aes(x = Sample_Mean)) +
  # Add density estimate
  geom_density(color = "red", linewidth = 1) +
  # Add histogram with improved aesthetics
  geom_histogram(
    aes(y = after_stat(density)),
    bins = 30,
    fill = "steelblue",
    alpha = 0.7,
    color = "white"
  ) +
  # Facet by distribution type and sample size
  facet_grid(
    Distribution ~ Sample_Size,
    scales = "free",
    labeller = labeller(
      Distribution = c(
        "Stream Temperature" = "Temperature (°C)\nNormally Distributed",
        "Air Pollution" = "PM2.5 Levels\nRight-Skewed",
        "Species Abundance" = "Species Counts\nDiscrete Data"
      )
    )
  ) +
  # Improved labels
  labs(
    title = "Central Limit Theorem in Environmental Monitoring",
    subtitle = "Sample means approach normal distribution as sample size increases",
    x = "Sample Mean",
    y = "Density"
  ) +
  # Consistent theme
  theme_cowplot() +
  theme(
    panel.spacing = unit(1, "lines"),
    strip.text = element_text(face = "bold"),
    plot.title = element_text(face = "bold"),
    plot.subtitle = element_text(margin = margin(b = 10))
  )
Code
# Print environmental monitoring implications
cat("\nImplications for Environmental Monitoring:\n\n")

Implications for Environmental Monitoring:
Code
cat("1. Stream Temperature:\n")
1. Stream Temperature:
Code
cat("   - Even small samples (n=10) give reliable means\n")
   - Even small samples (n=10) give reliable means
Code
cat("   - Good for continuous monitoring programs\n\n")
   - Good for continuous monitoring programs
Code
cat("2. Air Pollution:\n")
2. Air Pollution:
Code
cat("   - Requires larger samples (n≥30) for reliable means\n")
   - Requires larger samples (n≥30) for reliable means
Code
cat("   - Important for regulatory compliance\n\n")
   - Important for regulatory compliance
Code
cat("3. Species Abundance:\n")
3. Species Abundance:
Code
cat("   - Needs n≥30 for normal approximation\n")
   - Needs n≥30 for normal approximation
Code
cat("   - Critical for biodiversity assessments\n")
   - Critical for biodiversity assessments

Key Points for Environmental Scientists

The CLT helps us:

  1. Design monitoring programs
    • Choose appropriate sample sizes
    • Set sampling frequencies
    • Balance cost and accuracy
  2. Make reliable inferences
    • Estimate population parameters
    • Calculate confidence intervals
    • Test hypotheses about means
  3. Ensure quality control
    • Set warning thresholds
    • Monitor system changes
    • Make evidence-based decisions

Looking Ahead: CLT and Hypothesis Testing

The CLT is fundamental to statistical inference because it tells us that:

  1. Sample Means are Normally Distributed
    • Even when original data isn’t normal
    • Enables use of z-tests and t-tests
    • Supports confidence interval calculations
  2. Standard Error Matters
    • Measures uncertainty in sample means
    • Decreases with larger sample sizes: SE = \frac{\sigma}{\sqrt{n}}
    • Helps determine required sample sizes
  3. Coming Up in Future Lectures
    • One-sample tests: Compare means to standards
    • Two-sample tests: Compare different treatments
    • ANOVA: Compare multiple groups

Example: When testing if stream temperatures exceed regulatory limits, we’ll use:

  • Sample means (normally distributed thanks to CLT)
  • Standard error (to assess uncertainty)
  • t-tests (based on normal distribution assumptions)

Progress Check: Probability and CLT ✓

Let’s connect what we’ve learned:

  1. Probability Calculations
    • Lower tail P(X \leq x) - Early warning
    • Upper tail P(X \geq x) - Critical thresholds
    • Intervals P(a \leq X \leq b) - Normal ranges
  2. Sample Means (CLT)
    • Approach normal distribution
    • More reliable with larger samples
    • Enable statistical inference
  3. Applications
    • Design monitoring programs
    • Set evidence-based thresholds
    • Make reliable predictions

Key Achievement: You can now connect probability theory to practical environmental monitoring decisions! 🎯

Thanks!

This presentation is based on the SOLES Quarto reveal.js template and is licensed under a Creative Commons Attribution 4.0 International License.